NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Discussion of “Data Fission: Splitting a Single Data Point”

https://doi.org/10.1080/01621459.2024.2421998

Neufeld, Anna; Dharamshi, Ameer; Gao, Lucy L; Witten, Daniela; Bien, Jacob (January 2025, Journal of the American Statistical Association)

Full Text Available
Generalized data thinning using sufficient statistics

https://doi.org/10.1080/01621459.2024.2353948

Dharamshi, Ameer; Neufeld, Anna; Motwani, Keshav; Gao, Lucy L; Witten, Daniela; Bien, Jacob (May 2024, Journal of the American Statistical Association)

Full Text Available
Data thinning for convolution-closed distributions

Neufeld, Anna; Dharamshi, Ameer; Gao, Lucy; Witten, Daniela (March 2024, Journal of machine learning research)

Full Text Available
Selective Inference for Hierarchical Clustering

https://doi.org/10.1080/01621459.2022.2116331

Gao, Lucy L.; Bien, Jacob; Witten, Daniela (October 2022, Journal of the American Statistical Association)

Full Text Available
Sparse Reduced Rank Huber Regression in High Dimensions

https://doi.org/10.1080/01621459.2022.2050243

Tan, Kean Ming; Sun, Qiang; Witten, Daniela (April 2022, Journal of the American Statistical Association)

Full Text Available
Optimal estimation of variance in nonparametric regression with random design

https://doi.org/10.1214/20-AOS1944

Shen, Yandi; Gao, Chao; Witten, Daniela; Han, Fang (December 2020, Annals of Statistics)
null (Ed.)
Full Text Available
Exponential inequalities for dependent V-statistics via random Fourier features

https://doi.org/10.1214/20-EJP411

Shen, Yandi; Han, Fang; Witten, Daniela (January 2020, Electronic Journal of Probability)

Full Text Available
Adaptive nonparametric regression with the K-nearest neighbour fused lasso

https://doi.org/10.1093/biomet/asz071

Madrid Padilla, Oscar Hernan; Sharpnack, James; Chen, Yanzhen; Witten, Daniela M (January 2020, Biometrika)

Summary The fused lasso, also known as total-variation denoising, is a locally adaptive function estimator over a regular grid of design points. In this article, we extend the fused lasso to settings in which the points do not occur on a regular grid, leading to a method for nonparametric regression. This approach, which we call the $$K$$-nearest-neighbours fused lasso, involves computing the $$K$$-nearest-neighbours graph of the design points and then performing the fused lasso over this graph. We show that this procedure has a number of theoretical advantages over competing methods: specifically, it inherits local adaptivity from its connection to the fused lasso, and it inherits manifold adaptivity from its connection to the $$K$$-nearest-neighbours approach. In a simulation study and an application to flu data, we show that excellent results are obtained. For completeness, we also study an estimator that makes use of an $$\epsilon$$-graph rather than a $$K$$-nearest-neighbours graph and contrast it with the $$K$$-nearest-neighbours fused lasso.
more » « less
Full Text Available
Controlling costs: Feature selection on a budget

https://doi.org/10.1002/sta4.427

Yu, Guo; Witten, Daniela; Bien, Jacob (March 2022, Stat)

The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty or intrusiveness). In particular, unnecessarily including an expensive feature in a model is worse than unnecessarily including a cheap feature. We propose a procedure, which we call cheap knockoffs, for performing feature selection in a cost‐conscious manner. The key idea behind our method is to force higher cost features to compete with more knockoffs than cheaper features. We derive an upper bound on the weighted false discovery proportion associated with this procedure, which corresponds to the fraction of the feature cost that is wasted on unimportant features. We prove that this bound holds simultaneously with high probability over a path of selected variable sets of increasing size. A user may thus select a set of features based, for example, on the overall budget, while knowing that no more than a particular fraction of feature cost is wasted. We investigate, through simulation and a biomedical application, the practical importance of incorporating cost considerations into the feature selection process.
more » « less
Are clusterings of multiple data views independent?

https://doi.org/10.1093/biostatistics/kxz001

Gao, Lucy L; Bien, Jacob; Witten, Daniela (February 2019, Biostatistics)

Full Text Available

« Prev Next »

Search for: All records